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Abstract 



o 

Respondent-Driven Sampling (RDS) is an approach to sampling 
Qh design and inference in hard-to-reach human populations. Typically, 

ry^ a sampling frame is not available, and population members are dif- 

^ ficult to identify or recruit from broader sampling frames. Common 

examples include injecting drug users, men who have sex with men, 
and female sex workers. Most analysis of RDS data has focused on es- 

|~t"| timating aggregate characteristics, such as disease prevalence. How- 

ever, RDS is often conducted in settings where the population size is 
unknown and of great independent interest. This paper presents an 
approach to estimating the size of a target population based on data 

^ collected through RDS. 

The proposed approach uses a successive sampling approxima- 

y—( tion to RDS to leverage information in the ordered sequence of ob- 

served personal network sizes. The inference uses the Bayesian frame- 
work, allowing for the incorporation of prior knowledge. A flexible 

CN class of priors for the population size is proposed that aids elicitation. 

An extensive simulation study provides insight into the performance 
of the method for estimating population size under a broad range of 

^ conditions. A further study shows the approach also improves esti- 

i— I mation of aggregate characteristics. A particular choice of the prior 

j> produces interval estimates with good frequentist properties. Finally, 

ITj the method demonstrates sensible results when used to estimate the 

^ numbers of sub-populations most at risk for HIV in two cities in El 

Salvador. 
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1 Introduction to Respondent-Driven Sampling 



Respondent-Driven Sampling (RDS, introduced by Heckathorn (1997)) is 
an approach to sampling from hard-to-reach human populations in the in- 
terest of conducting statistical inference, typically on population propor- 
tions. In such hard-to-reach populations, a sampling frame for the target 
population is not available, and members are difficult to identify or re- 
cruit from broader sampling frames. In public health, RDS is often used 
in studies of high-risk populations such as injecting drug users, men who 
have sex with men, and female sex workers; 128 of these studies are sum- 
marized in Johnston et al. (2008). RDS has also been used in other popula- 
tions such as jazz musicians (Heckathorn and Jeffri, 2001) and unregulated 
workers (Bernhardt et al., 2006). 

RDS is a form of link-tracing network sampling, in which subsequent 
sample members are selected from among the social relations of current 
sample members. Unlike most link-tracing designs, respondent-driven sam- 
pling relies on study respondents to choose which of their contacts will be 
sampled next. Each respondent is given a small number of uniquely iden- 
tified coupons to distribute among their contacts in the target population. 
Contacts receiving coupons become eligible for the study. 

Most existing estimators from RDS data attempt to estimate popula- 
tion proportions 

(Gile, 2011; Gile and Handcock, 2011; Heckathorn, 1997, 2002; Salganik 
and Heckathorn, 2004; Volz and Heckathorn, 2008). Population size esti- 
mation based on RDS data is also of interest for three reasons: First, these 
data are often collected in precisely the populations in which there is in- 
terest in population size. In fact, RDS-based prevalence estimates are of- 
ten used in the Estimation and Projection Package (EPP) model used by 
UN AIDS (UN AIDS, 2009). For concentrated epidemics, EPP estimates na- 
tional HIV rates based on both prevalence and population size estimates 
for several high-risk populations. The resulting estimates of the numbers 
of infections are used in decisions about resource allocation, research de- 
sign and intervention planning (UNAIDS and World Health Organization, 
2010). Second, new prevalence estimators for RDS (Gile, 2011; Gile and 
Handcock, 2011) require estimates of the size of the population. And fi- 
nally, because the information in the sequence of RDS samples has not yet 
been exploited to estimate population size, this approach introduces a new 
source of information on the size of the hard-to-reach population. 
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There are many approaches to estimating the size of a hard-to-reach 
human population (Bao et al., 2010; Berchenko and Frost, 2011; Paz-Bailey 
et al., 2011; UNAIDS and World Health Organization, 2010), including a 
few that use link-tracing samples similar to RDS (Felix-Medina and Thomp- 
son, 2004; Frank and Snijders, 1994). The validity of these approaches typ- 
ically rests on strong assumptions about the populations and adherence to 
sampling designs. A common approach is to use capture-recapture sam- 
pling (Fienberg et al., 1999; Paz-Bailey et al, 2011; Rocchetti et al, 2011), 
which estimates population size based on the overlap between two or 
more captures of population members. Related approaches include the 
(network) scale-up or multiplier methods. In these, the captures may not 
be based on a sampling design but may be from enumerations or conve- 
nience samples. The accuracy of multiplier methods varies with the qual- 
ity of the captures data (Salganik et al., 2011; UNAIDS and World Health 
Organization, 2010). In particular, Fienberg et al. (1999) present a general 
approach to population size estimation using multiple-recapture data and 
develop a sophisticated Bayesian estimation approach for it. Because they 
are based on multiple captures, to date, all methods that use RDS data re- 
quire additional data collected by mechanisms other than RDS (Bernard 
et al., 2010; Johnston et al, 2011; Niccolai et al, 2010; Salganik et al., 2011). 
The primary contribution of this paper is to introduce an estimator of pop- 
ulation size based solely on RDS data. This approach is also novel in that 
the population size estimates are based on information in the sample se- 
quences, exploiting the dependence in the sampling process. In contrast, 
most inference from sampled data relies directly on the sampled values 
and treats dependence in the sample as a nuisance. The proposed esti- 
mator can also be combined with estimates based on other approaches to 
produce improved inference about the population size. 

The proposed approach is founded on the successive sampling approx- 
imation to the RDS process introduced in Gile (2011). It extends the ap- 
proach developed by West (1996) for ecological applications (e.g., estimat- 
ing the number of oil fields based on the sizes of the known fields). Un- 
der successive sampling, larger units tend to be sampled earlier. This ap- 
proach therefore leverages the information in the decreasing size of sam- 
pled units (a function of oil reserve magnitude for West's application, and 
social connectedness for RDS) over time to make inference about popula- 
tion size. Like West, it uses a super-population model-based formulation 
within a Bayesian inferential framework by positing a prior distribution 
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over population size. It differs from that of West in three key ways: First, 
the unit sizes are modeled as discrete rather than continuous. Second, the 
branching and network nature of the RDS sample may reduce or confound 
the information in the ordering of the sample. Third, the sample sizes of 
RDS samples are typically larger, and with a different range of unit sizes 
than in the data available in ecological applications such as oil fields. 

The next section (Section 2) develops the inferential framework and a 
flexible class of priors for the population size. In Section 3, a simulation 
study illustrates the frequentist performance of the population size esti- 
mator derived from the Bayesian framework, as well as the performance 
of a prevalence estimator based on this estimator. Section 4 applies the 
proposed method to data collected on two most-at-risk populations in El 
Salvador. Section 5 concludes the paper with a broader discussion. 

2 Bayesian Inference for the Population Size 

The goal here is to make inference for population size N. The approach 
taken is Bayesian, treating iV as an unknown parameter. This requires a 
probability model for the observed data given N, as well as a prior for N. 
Most information about the population size is drawn from the pattern in 
the sampling process. In particular, this sampling model is non-amenable 
to the model (Handcock and Gile, 2010). For this reason, the probability 
model must represent both the sampling structure, and a superpopulation 
model. 

The distribution of the sampling process of the units is modeled as a 
function of their unit sizes. The sampling model, described in Section 2.2 
below, follows Gile (2011) and is based on a successive sampling approx- 
imation to the RDS process. The super-population model for these unit 
sizes is given in Section 2.5. A likelihood function for iV can be computed 
from these two models and then combined with a prior to make inference 
for N. 

The inferential frame is described as follows: Section 2.1 introduces the 
form of the likelihood. Section 2.2 adds the particular form of the sam- 
pling distribution based on successive sampling. Section 2.3 introduces 
the Bayesian frame for inference for the parameter of the unit size dis- 
tribution, which is extended in Section 2.4 to inference for N. Section 
2.5 presents the parametric model for the unit size distribution. Finally, 
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Section 2.6 presents the forms of the prior distributions for the super- 
population model and the population size. 

2.1 Likelihood for the Super-population Parameter 

Consider a population of N units, denoted by indices 1, . . . , TV with an 
associated variable unit size represented by U\, C/ 2 , • • • , Un- For RDS, unit 
sizes are often the numbers of network connections, also known as per- 
sonal network sizes or degrees, but they can be any function of individ- 
ual unit variables. The unit sizes are treated as an i.i.d. sample of size 
N generated from a super-population model based on some (unknown) 
distribution. For simplicity of presentation, the unit sizes are presumed 
to have the natural numbers as their support (e.g., degrees). Specifically: 

Ui"~'f(-\r)) where f(-\rj) is a probability mass function (PMF) with support 
1, . . . , and f) is a parameter. 

Consider first a general ordered sampling design. The indices of the se- 
quentially sampled units identify the order of the sample, denoted by the 
random vector G = (G± : . . . , G n ), where realized values g = (g±, . . . , g n ) 
represent the observed indices of the successively sampled units. Let \g = 
{1, . . . , N}\{gi, ... ,g n } represent the set of indices of the unobserved pop- 
ulation units. Further, consider the observed and unobserved unit sizes. 
Let Uobs = {U gi , U 92) . . . , U 9n }, the random vector of observed unit sizes, 
with values u obs = (u 9l ,. . .,u 9n ). Similarly, let U unobs = {Ui} ie \ g and u unobs = 
{ui} i( z\ g represent the random and realized values of the unit sizes of the 
unobserved units. Thus the full observed data are {g, u obs }. 

Inference for 77 should be based on all the available observed data in- 
cluding the sampling sequence information. The likelihood is any function 
of 77 proportional to p{G, U obs \rj): 

L[v\Uobs = u obs , G = g] oc p(G = g, U obs = u obs \r]) 

= ^2 P ( G = 9> U = ( U obs,Uunobs)\v) 

n 

Yl P( G = 9\U = (u obs ,u unobs ))Y[f(u 9j \r)) ■ Y[f(uj\r]),(l) 

v- uno bs£U(u obs ) j=l je\g 

where U(u obs ) is the set of possible u unobs given u obs , that is the unit sizes 
possible for the remaining N — n units given that the first n sampled were 
u obs . Typically, this will be the N — n product support of f(-\rf). Thus the 
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correct model is related to the complete data model through the sampling 
design as well as the super-population model. 



2.2 Likelihood Under Successive Sampling 

Following Gile (2011), the RDS sampling is approximated as a successive 
sampling process. Gile argues that this model approximates a without- 
replacement random walk on the network, and demonstrates that using 
this model can reduce finite population biases for RDS estimates of pop- 
ulation characteristics. This sampling scheme is also known as probability 
proportional to size without replacement (PPSWOR) sampling, and is treated 
in the survey sampling and ecological literature (Andreatta and Kaufman, 
1986; Bickel et al, 1992; Nair and Wang, 1989). The Successive Sampling (SS) 
sampling procedure is defined as follows: 

• Sample the first unit from the full population {1, . . . , N} with proba- 
bility proportional to unit size u u i = 1, .. . , N: p(Gi = k) = u k /Y^ = iUj, k = 
1,...,N. 

• Select each subsequent unit with probability proportional to unit size 
from among the remaining units, such that 

P iG i = k\G 1 =g 1 ,...,G^ 1 = g i . 1 ) = \ ,i = 2,...,n. 

otherwise 

(2) 

The probability of the observed sequence g for a given population of 
unit sizes is: 



p(G = g\U = (u obs , u unohs )) = ' TT 

(N -n)\ r k 

N k-1 



where 



r 'k = ^2 u i-^2 u g 3 k = l,...,n, (3) 

i=l j=l 

so that the full likelihood is: 

L[v\U bs = u obs , G = g] oc p(G = g, U obs = u ohs \v) (4) 

AT\ n n 
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This likelihood can be the basis of maximum likelihood estimation for rj. 
In general, this sum will be very difficult to compute because of the N — n 
embedded sums over typically infinite supports of f(-\rj). 

Note that this likelihood involves models for both the sampling design 
and the super-population, necessary because the design is not amenable 
to the model (See the Supplementary Materials for details). Intuitively, 
ignoring the sampling distribution would likely result in positive bias in 
inference about unit sizes as the larger-sized units will tend to be sampled 
first. 



2.3 Bayesian Inference for the Unit Size Distribution 

This section develops inference for the unit size distribution, conditional 
on known N. 

p(r}\G = g, U obs = u obs ) oc 7r(r])-L[r]\U obs = u obs , G = g], 
where ir(i]) is a prior for the unit size distribution parameter. For simplicity 
of notation, denote the observed data by D = (U obs = u obs , G = g). 

Because of the complexity of computing the likelihood (4), West (1996) 
suggests using the relatively simple 

N\ n u n 

p(G = g obs , U obs = u obs , U unobs = u unobs \r]) = JJ f(u 9j \rj) ■ JJ f(uj\ri) 

y n >- k=l Tk j=l je\g 

(5) 

and a two component Gibbs sampler with p{i]\D,U unobs = u unobs ) and 
p(U unobs = u unobs \r], D). The current paper uses a variant of this approach 
for discrete unit size distributions. 
From (5), 

n 1 

p(U unobs = u unobs \r),D) oc Yl — JI f( u j\v)- (6) 

k=i Th je\g 

As the r k are hard to deal with, West (1996) notes that, 

1 f°° 

-= / e - r ^di> k , 
r k Jo 

where ipk has exponential distribution with rate parameter r k . That is, 

p(ip k = ip\r], U unobs = u unobs , D) = r fc exp(-r fc ^) ip > 0, (7) 
He augments the data with ^ = . . . , ifj n ) where the components are 
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drawn (conditionally) independently so that 

p{U U nobs = u unobs , ty\r), D) = = ip\r], U unobs = u unobs , D) ■ p(U unobs = u unobs \r], D) 

n 

oc \{e^ J] f( Uj \ V ) 
j =1 j=e\g 

and from (3), 

p(U unobs = U unobs \ty, ?7, D) OC = lfj\f], U unobs = U unobs , D) ■ p{U unobs = U unobs \l], D) 



n 



oc JJ e -^£i=€\*«i . JJ e -^EJ=i« flj - . Yl f( Uj \rj) 

1=1 i=l i = 6\fir 

oc JJ e-^ ^=1^/(^177). (8) 
je\g 

Hence the elements of U unobs are conditionally an i.i.d. sample from the 
unnormalized PMF e~"^=i ^/(-u|r/), and are, in fact, conditionally inde- 
pendent of D. 

The augmented posterior: 

p(v, U unobs = u unobs , V\D) (9) 
can then be easily computed via a three component Gibbs sampler. Details 
of this and an explicit statement of the MCMC algorithm are given in the 
Supplementary Materials. 

The algorithm produces samples from p(r)\D) and from the posterior 
predictive distribution for the unobserved unit sizes: p{U unobs = u unoba \D). 
These in turn enable inference for such quantities as the mean unit size, 
the unit size distribution, etc. 



2.4 Estimating the Size of the Hidden Population 

When N is unknown, it becomes an additional parameter to be estimated. 
For simplicity, specify that iV and 77 are a priori independent so that rc(N, rj) = 
ir(N)-7i(i]). A variant of the approach in the last section allows draws from 
the joint posterior 

p(N, 77, Uunobs = U unohs \D). (10) 
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This change requires p(N, rj,^\D). Using (5) and (7), 

p(D, U unobs = u unobs \N, 77) 
oc p(ty = ip\N,rj,D,U unobs = u unobs ) -p(D, Uunobs = u unobs \N,r]) 



(11) 



oc 



(N — n)\ 
TV! 



■\{u g J{u gi \ri)\{e-^=^ - \{e-^=^ . J] f( Uj \rj) 

j=e\g 



i=i 



oc 



Ms 



(12) 



(N-n)\ i , 

The full-conditional for iV is 

p(JV|77,*,£))oc7r(JV)p(£)|7V,77,*)=7r(7V) ^ p(D, U unobs = u unobs \N, 

u unobs 

eU(u obs ) 



oc 



oc 



oc 



(N — n)\ 

M 
(N-n)\ 

N\ 
(N-m)\ 



n 



TT(iV) £ 

W U no6s€t/(u oi , s ) l:MS 

AT I 00 

j=fl+l {Vj=l 

N-n 



ir(N) 



i=i 



where 7(0,77) = J^e 



The other full-conditionals are unchanged. This leads to a four component 
Gibbs sampler, the details of which are given in the Supplementary Mate- 
rials. The algorithm can be run to produce a large sample from the aug- 
mented posterior: p(N, rj, U unobs = u unobs , ^\D). This can then be marginal- 
ized to produce samples from p(N\D), p(rj\D), and the posterior predictive 
distribution of the unobserved unit sizes, p(U unobs = u unobs \D). Hence it 
produces posterior predictive distributions of the full population of unit 
sizes (ui,i = 1,...,N). These posteriors enable inference for such quan- 
tities as the population size, the mean unit size, the unit size distribution, 
etc. 



2.5 Models for the Unit Size Distribution 

This approach is general with respect to the parametric distribution of unit 
sizes, and several parameterizations are included in the software (Hand- 
cock, 2011). The question of models for the degree distributions of social 
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networks has been extensively studied. Discussion of the many alternative 
distributional families is given in the Supplemental Materials. Through- 
out the study of the synthetic populations and application to real data the 
Conway-Maxwell-Poisson distribution for the unit sizes is used. It allows 
both under-dispersion and over dispersion with a single additional pa- 
rameter over a Poisson (Shmueli et al., 2005). 

2.6 Prior specification 

2.6.1 Prior for the unit size distribution model 

Each two-parameter unit size distribution, including the Conway-Maxwell- 
Poisson, can be parameterized in terms of its mean and standard devia- 
tion. The prior for the mean given the standard deviation is normal and 
the variance is scaled inverse Chi-squared: 

fi\a ~ N(fi , cr/df mcan ) a ~ Invx(o" ; df sigma ). 

The default prior on these parameters is close to uninformative, equivalent 
sample size of df mcim = 1 for the mean of the unit size distribution and 
df S igma = 5 for the variance of the unit size distribution. 

2.6.2 Prior for the population size 

The model allows for an arbitrary prior distribution over the population 
size (N). However, this is an opportunity to choose priors that aid elic- 
itation of expert prior information or easily incorporate previous or con- 
comitant sources of information about the population size. 

The most common parametric models for N (e.g., Negative Binomial) 
typically have too thin tails for large N. This issue has been thought 
through by Fienberg et al. (1999). They suggest the prior: 

n(N) = (N-l)\/N\ for n < N < N max , (14) 
where N max covers the range where the likelihood is non-negligible. For 
their applications they choose their Jeffrey's prior I = 1, ir(N) oc 1/7V, n < 
N < N max . In addition to these possibilities, we propose a new class of 
priors specifying knowledge about the sample proportion (i.e. n/N) as a 
Beta(a, (3) distribution. The implied density function on iV (considered as 
a continuous variable) is: 

7r(jV) = f3n(N - nf- l /N a+p for N > n. (15) 
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The distribution has tail behavior 0(l/N a+l ). Figure 1 presents three dif- 
ferent versions of this prior, corresponding to a prior mean, median and 
mode of 1000. We have found this class of priors to be very useful: It is 
often relatively flat in regions where the likelihood is centered. The long- 
right-tail allows large population sizes but the rate of decline ameliorates 
this. 

When a — I — 1 > 0, this class is similar to that of Fienberg et al. (1999). 
The Beta prior class, however, is directly motivated as a proper prior on 
the sample proportion. Additional details on this prior are given in the 
Supplemental Materials. 







A mean=1000 

/ \ median=1000 

\ mode=1000 









500 1000 1500 2000 2500 

Population size (N) 



Figure 1: Three example prior distributions for the population size (TV). 
They correspond to a = 1 and (3 = 1.55, 1.16 and 3. 



3 A Simulation Study to Assess Frequentist Prop- 
erties 

The primary focus of the simulation study is to evaluate the frequentist 
performance of the proposed estimator for population size, including point 
estimation, interval estimation, and sensitivity to prior specification. Of 
secondary interest is the use of these estimates to inform the estimation of 
population proportions using the estimator introduced in Gile (2011). 
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The parameters of the study are largely chosen for consistency with 
the simulation studies of RDS-based estimators of population proportions 
in Gile and Handcock (2010), Gile (2011), and Tomas and Gile (2010), al- 
though with a greater range of population sizes. As in these studies, to 
increase the realism of the study, parameters match the characteristics of 
the pilot data from the CDC surveillance program (Abdul-Quader et al., 
2006) whenever possible. The general procedure is: (1) 200 Networks are 
simulated under each test condition; (2) An RDS sample is simulated from 
each sampled network; (3) Point and interval estimators of population size 
are computed from each sample. 

For the simulation, all samples are of size 500, and the population mean 
degree is fixed at 7. A discoverable class, referred to as "infection status," 
is assigned to each member of the population such that each population 
has prevalence 20%. 

The varying characteristics of the synthetic populations are also cho- 
sen to represent those expected in the real world. These characteristics 
include population size (i.e., the number of nodes), tendency for individu- 
als to preferentially form relations with others of the same infection status 
(known as homophily), and different rates of network connectivity by in- 
fection status (referred to as differential activity). 

Population network structures are modeled using Exponential family 
Random Graph Models (ERGM) (Snijders et al., 2006). That is, the N x N 
binary matrix of relations, y, is represented as a realization of the random 
variable Y with distribution: 

P V (Y = y\x) = ex.p{rj-g(y, x) - k(t], x)} y ey, (16) 
where x are covariates, g(y,x) is a p-vector of network statistics, rj e W 
is the parameter vector, y is the set of all possible undirected graphs, and 
exp{/t(?7, x)} — J2 u &y ex P{V'9( u , x )} is t ne normalizing constant (Barndorff- 
Nielsen, 1978). 

This modeling framework can represent a very wide range of popula- 
tions, with the particular structures determined by the choice of g(y,x). 
Here, differential activity is parameterized as the ratio, cu, of the mean 
degree of infected nodes to the mean degree of uninfected nodes, where 
uj — 1 represents the absence of differential activity. While homophily 
can and will occur on multiple variables, the most impactful type is that 
on infection status. Homophily is parameterized, for fixed mean degree of 
each group, as the ratio, a, of the expected number of discordant-infection- 
status ties absent homophily to the expected number of discordant-infection- 
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status ties with the homophily: 

E (number of infected-uninfected ties absent homophily) 
E (number of infected-uninfected ties) 
so that larger values of a indicate more homophily This measure is mean- 
ingful across different levels of differential activity Note that this param- 
eterization of homophily is different from that in earlier studies (e.g. Gile 
and Handcock, 2010). 

These features are represented in the ERGM by choosing network statis- 
tics to represent the mean degree, the relative activity levels of the two 
groups, and homophily. The binary nodal covariate Xi represents infection 
status, such that xi — 1 indicates infection. These three parameters then 
map to the expected cell counts of the mixing matrix on infection status. 
Our networks are thus simulated from an ERGM with 

N N N 

g(x,y) = {^^XiXjyij^^Xiii - x^yij^^ii - Xi)(i - x^yij}. 

i=l j^i i=l j^i i=l j^i 

The range of population characteristics modeled is (a) population size: 
N e {5000,1500, 1000,750,555}; (b) differential activity: to e {0.5,1,2}; 
and (c) homophily: a G {1, 1.8}. The i] parameter of the ERGM is chosen 
so the expected values of the statistics are equal to the values given above, 
and the simulated networks are generated from the resulting model, as 
in van Duijn et al. (2009). This was implemented using the R package 
statnet (Handcock et al., 2003). 

Ten seed nodes are chosen for each sample, selected (sequentially) with 
probability proportional to degree. Subsequent sample waves are selected 
without-replacement by sampling two nodes (where possible) at random 
from among the unsampled alters of each sampled node. This typically re- 
sulted in four complete waves and part of a fifth wave, stopping at sample 
size 500. 

Throughout the simulations, unit sizes are modeled with a Conway- 
Maxwell-Poisson distribution with diffuse priors (hyper parameters /xo = 
7, rf/ m ean = 1, cr = 3, ^/sigma = 5). The prior for the population size is a Beta 
distribution with a = 1 as describe in (15). Sensitivity of the method to 
incorrect priors is tested with priors based on three different prior means: 
equal to the true population size, N, a high estimate equal to 2N, and a 
low estimate halfway between N and the sample size n: (N + n)/2. 

Results from each simulation are summarized using the posterior mean 
as a point estimate, and the 95% highest posterior density region as an 
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interval estimate. 



3.1 Point and Interval Estimation of Population Size 

Figure 2 summarizes population size estimates based on simple network 
structures with no homophily (a = 1) or differential activity (tu = 1) for 
all five population sizes (corresponding to different sample fractions), and 
low, accurate, and high prior estimates of population size. 



1.1 



96 96 



527.5 555 1110 625 750 1500 



750 1000 2000 

Prior Mean 



.3712 100 



1000 1500 3000 2750 5000 10000 



Figure 2: Spread of central 95% of simulated population size estimates 
(posterior means) for 5 population sizes for low, accurate, and high priors. 
Dots represent means. Estimates are represented as multiples of the true 
population size (red line at 1 indicates true population size). Numbers be- 
low the bars are coverage rates of 95% HPD intervals, numbers above the 
bars indicate the widths of these intervals (ratio of median upper bound 
to truth). 
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When the prior is correct (blue lines), average point estimates are rea- 
sonably close in all cases. There is a small amount of positive bias. This 
is because of the successive sampling (SS) approximation to the true link- 
tracing network sampling process. In SS, the next unit sampled would be 
chosen with probability proportional to unit size from among all unsam- 
pled units. In RDS, the network structure constrains the selection of each 
subsequent unit, with the effect that the decrease in sampled unit sizes 
over time is less sharp than in successive sampling, leading to a slight 
positive bias in the population size estimates. The coverage rates of nom- 
inal 95% credible intervals are about right in the case of accurate prior 
information. 

Because there is limited information regarding population size in the 
RDS samples, results are affected by the choice of prior mean, with greater 
impact for smaller sample fractions. This is because smaller sample frac- 
tions entail less exhaustion of the target population and therefore less in- 
formation in the data about population size. 

The coverage rates for the 95% HPD regions for cases of prior mis- 
specification range from 83% to 100%, with higher coverage rates for higher 
prior means. These intervals can be quite wide. Because of the lower 
bound induced by the sample size, interval width is largely determined 
by the upper limits. The numbers above the bars in Figure 2 represent the 
median across each set of 200 simulations of the upper limit of the HPD 
intervals, represented as a multiple of the true population size. When the 
population size is close to the sample size, intervals are quite tight, while 
smaller sample fractions yield intervals that are often very large, with me- 
dian upper point 3.2 times the true population size for iV = 1500, and even 
wider (5.7 times) with higher prior mean of 3000. 

3.2 Impact of Network Structure 

Figure 3 summarizes the results of varying the levels of homophily and 
differential activity, for varying prior specifications, all for populations of 
size 1000. Each pair of columns provides comparison across two levels of 
homophily (a). The paired columns are very similar, in both point and in- 
terval estimates, and across all levels of differential activity. This suggests 
that under a broad range of circumstances, homophily does not have a 
strong first-order impact. 

There does appear, however, to be a first order impact of differential 
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Figure 3: Spread of central 95% of simulated population size estimates 
(posterior means) for population size 1000 for low, accurate, and high pri- 
ors, with varying levels of homophily (a) and differential activity (a;). The 
legend is the same as Figure 2. 

activity (u = 0.5 and to = 2), across levels of homophily By systemat- 
ically varying mean degree across infection groups, differential activity 
increases the variation in the unit size distribution, increasing the rate of 
decline in sampled unit sizes, and therefore providing more information 
about population size, resulting in better point estimates. Note how the 
prior mean 2000 cases have point estimates far closer to the truth when 
<jj = 2 as compared to w = 1. The credible intervals (HPDU ratios) are 
also typically smaller for u ^ 1 cases, without substantial reduction in the 
coverage rates. 
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3.3 Estimation of Population Proportions 



RDS is typically conducted in the interest of estimating population fea- 
tures such as population proportions. Earlier estimators based on RDS 
data assumed the population size was very large with respect to the sam- 
ple size, so that finite population effects could be ignored. The more recent 
estimator which introduces the successive sampling (SS) approximation 
on which this paper is based (Gile, 2011), however, includes a finite pop- 
ulation adjustment, but assumes that the population size is known. It is 
natural, therefore, to use the approach to population size estimation intro- 
duced in this paper to provide a population size estimate for use in the 
prevalence estimator in Gile (2011). Hence this section considers estimates 
of infection prevalence using the SS estimator. This section compares re- 
sults using the prior mean as the population size to results using the pos- 
terior mean. 

Figure 4 shows the SS results for the same simulation conditions as 
Figure 3. Absent differential activity (Figure 4, middle 2 columns), there 
is little difference between results using the prior and posterior means. 
This is because the SS estimator re-weights the sample based on unit sizes 
(degrees) and the assumed population size. Absent differential activity, 
the infected and uninfected subsamples will have similar degree distribu- 
tions, and therefore be similarly affected by any aberrations in the unit 
weights. 

In the presence of differential activity, inaccurate population size esti- 
mates introduce bias in the point estimates given by the successive sam- 
pling estimator (see Gile, 2011). The first and last two columns of Figure 
4 compare prevalence estimates based on the prior and posterior means 
for cases with strong differential activity. Consistent with Gile (2011), the 
dashed bars, corresponding to estimates using the prior mean, show sub- 
stantial bias when the population size is inaccurate. This is due to im- 
perfect adjustment for finite population effects. The primary advantage 
of estimates based on the posterior mean is the reduction of this dramatic 
bias. The cost of this reduction, however, is increased variance of the esti- 
mator, resulting in higher MSE in cases with small finite population biases, 
such as when the prior mean is correct. Note that because the bootstrap 
standard error estimator associated with the successive sampling estima- 
tor does not account for uncertainty in the population size, coverage rates 
can be dramatically low, especially for the estimator based on the prior 
mean. 
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Figure 4: Spread of central 95% of simulated prevalence estimates for pop- 
ulation size 1000, with varying levels of homophily (a) and differential 
activity (cu). Solid lines represent prevalence estimates based on the pos- 
terior mean, dashed lines represent comparable estimates using the prior 
mean. Relative efficiency (MSE posterior /MSE prior) is given above each 
bar, and the coverage of nominal 95% confidence intervals is below each 
bar. The true prevalence is 0.2 (blue line). 

4 Application: Estimating the numbers of those 
most at risk for HIV in Cities in El Salvador 

Overall, El Salvador is a country with low HIV prevalence. As of 2010, 
the adult HIV prevalence was estimated at 0.8% (UNAIDS/WHO, 2010). 
However the virus remains a significant threat in groups who practice 
high-risk behaviors, such as female sex workers (FSW) and men who have 
sex with men (MSM) (Morales-Miranda et al, 2007; Soto et al, 2007). 
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This section analyses two studies from data collected in 2010 as part of 
a series of RDS studies of populations most at risk for HIV across major El 
Salvadorian cities (Paz-Bailey et al., 2011). 

The first case is a study of FSW in the department of Sonsonate, which 
had a population of about 540,000 in 2008 (Guardado et al, 2010). This 
study began with 5 seeds and continued until wave 9, with 11 samples 
from wave 8 and 5 from wave 9, and a total of 184 samples. The average 
wave number was 3.8. A graph of the recruitment is given in the Supple- 
mentary Materials. 

As is typical in these settings, the number of female sex workers in 
Sonsonate is unknown. The public health officials use the UNAIDS na- 
tional HIV estimation work group recommendation to estimate the num- 
ber of FSW based on a percent from the total adult female population 
(UNAIDS/WHO, 2003). In 2009, this group estimated that FSW consti- 
tute 0.4%-0.8% of the urban female population 15-49 years of age (139,804) 
(de Estadistica y Censos, 2007; Paz-Bailey et al., 2011). The range for Son- 
sonate is then 560 to 1120 FSW. 

Estimates of the total number of FSW in Sonsonate are calculated with 
three different priors. The first prior is constant over the range of popula- 
tion sizes where the likelihood is non-negligible. Figure 5(a) plots both the 
prior and posterior distributions. The peakedness of the posterior shape 
indicates that there is information in the data about the population size, 
with a mode of around 1250 FSW. The UNAIDS guidelines (purple lines) 
fall in the mid to low part of the posterior distribution, are broadly consis- 
tent with it, and fall within the 95% HPD interval (blue lines). 

The second prior places the prior mean at the mid-point of the UN- 
AIDS suggested range 0.6% x 139804 = 838. Figure 5(b) plots this prior 
and the resulting posterior. The mean, median and mode of the posterior 
fall within the UNAIDS guidelines (purple lines) and these guidelines fall 
within the the 95% HPD interval. As expected, this prior results in more 
mass in the posterior in the area of the UNAIDS estimates. 

As UNAIDS provides a range of values, it may be useful to specify a 
prior based on multiple points in that range. The class of priors described 
by (15) allows the flexibility to choose a prior that reflects a range of values. 
In this case, two parameters (a, (3) are chosen so that the prior mean is the 
midpoint of the range (0.6%) and the lower quartile of the prior is the 
lower UNAIDS estimate (0.4%). Figure 5(c) plots this prior and resulting 
posterior distribution. This distribution has a mode at about 1000 FSW, 
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Figure 5: Posterior distribution for the number of female sex workers in 
Sonsonate based on three prior distributions for the population size: flat, 
matching the midpoint UNAIDS estimate, and interval-matching the UN- 
AIDS estimate. The prior is dashed. The red mark is at the posterior me- 
dian. The green mark is at the posterior mean. The blue lines are at the 
lower and upper bounds of the 95% highest-probability-density interval. 
The purple lines demark the lower and upper UNAIDS guidelines. 
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and a 95% HPD interval from 481 to 1998 FSW. 

The above analysis is now repeated for a study of MSM in San Miguel. 
In 2009, the UNAIDS national HIV estimation work group estimated the 
number of MSM in El Salvador at 2%-5% of the urban male population 15- 
49 years of age (148,489) (de Estadistica y Censos, 2007; Paz-Bailey et al., 
2011; UNAIDS/WHO, 2003). The range for San Miguel is then 2970 to 
7425 MSM. 

Figure 6 shows the same three prior specifications as the previous case. 
Figure 6(a) plots the posterior distribution and the prior when the prior is 
constant over the range of population sizes where the likelihood is non- 
negligible. The peakedness of the posterior shape again indicates that 
there is information in the data about the size, but it is diffuse and has 
a long upper tail compared to that for the FSW. The UNAIDS guidelines 
(purple lines) fall in the mid to upper part of the distribution, and are well 
within the 95% HPD interval (blue lines). 

Figure 6(b) plots the posterior distribution based on the prior with 
mean the mid-point of the UNAIDS suggested value 3.5% x 148489 = 5197. 
The majority of the posterior is below the UNAIDS estimates as the prior 
pulls in the larger values while the 95% interval mostly covers them. 

Figure 6(c) plots the posterior distribution based on the prior fixing the 
mean at the midpoint of the range (3.5%) and the lower point (2%) at the 
lower quartile. This prior contains the most information from the UNAIDS 
work group and hence is perhaps the best choice. The resulting posterior 
distribution has a mode at about 2100 MSM, and a 95% HPD interval from 
200 to 7048 MSM. Thus this method yields an estimate of the number of 
MSM in San Miguel with a wide interval. 

Another reference point for population size estimates in El Salvado- 
rian cities can be extrapolated from the results of another study in San 
Salvador. In this study, capture-recapture was used to estimate the num- 
ber of MSM and FSW in San Salvador in 2008 (Paz-Bailey et al, 2011). This 
approach required the distribution of tokens (e.g., key chains) throughout 
the population followed by a recapture step with a follow-up survey. Paz- 
Bailey et al. (2011) estimate that the size of the FSW population in San 
Salvador is almost double the UNAIDS figures (1.4%). This population 
proportion in Sonsonate would translate to 2079 FSW, close to the poste- 
rior mean of the proposed method for the flat prior, but in the upper tail 
of the posteriors based on priors based on the UNAIDS guidelines. Paz- 
Bailey et al. (2011) estimate that the size of the MSM population in San 
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Figure 6: Posterior distribution for the number of MSM in San Miguel 
based on three prior distributions for the population size: flat, matching 
the midpoint UNAIDS estimate, and interval-matching the UNAIDS esti- 
mate. The legend is the same as Figure 5. 
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Salvador is close to the UNAIDS figures (3.4%). This is somewhat high 
but comparable to the MSM results in Figure 6. Thus the results here are 
largely comparable to those of Paz-Bailey et al. (2011) when they differ 
from the UNAIDS guidelines. 

5 Discussion 

The primary contribution of this paper is a method to estimate population 
size from RDS data alone. All existing methods require at least two data 
sources, and strong assumptions about their dependence structure. Intu- 
itively when unit sizes are associated with sampling probability a system- 
atic decline in observed unit sizes over time is indicative of the depletion of 
the available population. As described in this paper, a successive sampling 
approximation to the RDS process leverages this change in observed sizes 
to estimate the size of the hidden population. These data were previously 
unexploited in the estimation of the size of hard-to-reach human popula- 
tions. Because RDS is designed for inference in hard-to-reach populations, 
such data often exist in precisely the populations where population size 
is unknown but of great interest. Thus this method provides additional 
important information, that is, an estimate of population size, at no extra 
cost. Furthermore, the Bayesian framework of this work allows for easy 
incorporation of informative prior information or data from other sources. 

The main limitation of the proposed method is the small amount of 
information on population size available in many RDS samples, with less 
information available in smaller sample fractions. In cases with little in- 
formation, this method results in very large interval estimates in order 
to obtain reasonable frequentist coverage properties. However, existing 
methods are also subject to great uncertainty, even though they typically 
require additional data collection. Often these methods lead to apparently 
conflicting results because they poorly estimate measures of uncertainty 
(Salganik et al., 2011). The advantage of the proposed method is that it 
uses existing data and accurately assesses the degree of uncertainty in TV 
over a wide range of practical situations. Results from the two cases in El 
Salvador provide estimates compatible with UNAIDS guideline estimates 
as well as capture/ recapture estimates of population sizes. 

This method is also useful for estimators of population characteris- 
tics that require an estimate of the population size. The simulation study 
demonstrates that using population size estimates from the proposed method 
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in the SS estimator (Gile, 2011) works well and is particularly helpful in 
conditions of strong differential activity and larger sample fractions. 

The framework developed in this paper is designed to be a foundation 
upon which other approaches to population size estimation can build. In 
particular it is designed to facilitate combination with data from multi- 
ple methods (e.g., direct surveys, capture-recapture, network scale-up and 
multiplier). The posterior from the approach in this paper can be used as 
a prior for combination with information from these alternative methods. 
Thus this methodology will lead to coherent inference that can be incre- 
mentally improved in a constructive way. 

While the methods in this paper have been applied to data collected 
via RDS, we note that the approach is general and applies to data col- 
lected via successive sampling. Hence the method has broad applicability 
(Andreatta and Kaufman, 1986; Bickel et al, 1992; Nair and Wang, 1989). 

The R package implementing the methods developed in this paper 
(Handcock, 2011) will be available on CRAN (R Development Core Team, 
2011). 

6 Supplementary Materials 

Inferential and Computation: This supplement presents specifics of the 
estimation algorithms, variations of models for the unit size distri- 
bution, a proof of the non-amenability of RDS, and the recruitment 
graph for the study of female sex workers in El Salvador. 
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Supplemental Materials: 
Estimating Hidden Population Size 
using Respondent-Driven Sampling Data 

Mark S. Handcock* Krista J. Gile t Corinne M. Mar* 

September 27, 2012 
S.l Algorithmic Details 

This section derives the algorithm used to compute the joint posterior for 
the population size, the super-population parameter rj and the unobserved 
population sizes U unobs . It is based on that developed by West (1996). 

First consider the case in Section 2.3 where the population size N is 
assumed known. The augmented posterior: 

p(v, U unobs = u unobs , V\D) (S.l) 

can be computed via a three component Gibbs sampler. Explicitly, the 
steps are: 

1. Initialize u unobs at a set of unit sizes. 

2. Sample i] from 

n 

p(v\U un0 bs = u unobs ,^,D) = n(v)-Ylf( u 9j\v) ■ II f( u j\v) (S-2) 

i =1 je\g 

3. Sample * from p(^- = ip\r], U unobs = u unobs , D) in equation (7). 

4. Sample U unobs from p(U unobs = u unobs \^, rj, D) in equation (8). 

5. Repeat steps (2) through (4) until convergence. 

* Professor of Statistics, Department of Statistics, University of California, Los Angeles, 
CA 90095-1554 (E-mail: handcock@ucla.edu). 

^Assistant Professor of Statistics, Department of Mathematics and Statistics, Univer- 
sity of Massachusetts, Amherst, MA 01003-9305 (E-mail: gile@math.umass.edu). 

*CSDE, University of Washington, Seattle, WA 98195-1554 (E-mail: cmmar@uw.edu). 



The MCMC should be run until burn-in and then sampled after an 
interval of iterations to produce a large sample from the augmented pos- 
terior (S.l). Standard MCMC diagnostics can be applied to assess conver- 
gence (Gilks et al., 1996). In our experience, the procedure is well behaved 
and converges quickly. The augmented posterior can be marginalized to 
produce samples from p(r]\D) and from the posterior predictive distribu- 
tion for the unobserved unit sizes: p(U unobs = u unobs \D). 

These in turn enable inference for such quantities as the mean unit size, 
the unit size distribution, etc. 

We make the following step-by-step remarks on the algorithm: 

1- u unobs ~ f(-\wo), where rj is a specified starting value. 

2. This is done with a Metropolis-Hastings algorithm with Gaussian 
proposal for the mean size and a inverse x for the standard deviation 
of the size. 

3. These are independent standard Exponential draws. 

4. This is done with a rejection algorithm (West, 1996). For each element 

of Uunobs . 

(a) Draw d ~ f(-\rj) an d independently, u ~ C/(0, 1). 

(b) If log(w) > — d then reject d and return to (a); otherwise save d as 
the element of u unobs and return to (a) for the next element. 

In the situation in Section 2.4 where iV is not known, we need to extend 
the above algorithm to sample from the joint posterior p(N, rj, U unobs , ^\D). 
The algorithm is: 

1. Initialize iV at a point estimate and u unobs at a set of unit sizes. 

2. Sample i] from 

n 

pijlWunobs = u unobs ,^,D, N) = n(rj)-Y[f(u 9j \rj) ■ J J fiu^r]) (S.3) 

3. Sample ^ from p(ipj = ip\r), U unobs = u unobs , D, N) in equation (7). 

4. Sample N from p{N\i], \l>, D) in equation (13). 
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5. Sample U unohs from p(U unobs = u unobs \^, 77, D, N) in equation (8). 

6. Repeat steps (2) through (5) until convergence. 

As before, this expanded MCMC can be run to produce a large sample 
from the augmented posterior: 

p(N, 77, Uunohs = u unobs , V\D) (S.4) 

This can then be marginalized to produce samples from p(N\D), p(r)\D) 
and the posterior predictive distribution of the unobserved unit sizes, p(U uno b s = 
u U nobs\D)- Hence it produces posterior predictive distributions of the full 
population of unit sizes {u^i = 1,...,N). 

The step-by-step remarks are the same as before. For step 4, the com- 
plete PMF is discrete and is computed directly from (13) and hence sam- 
pled from directly. 

Both these algorithms have been implemented at the C level in the R 
package size (Handcock, 2011). They are accessible via a user-friendly 
front-end. It includes a range of unit size distribution models, each of 
which is parametrized in terms of the mean and standard deviation of the 
distribution. 

S.2 Models for the Unit Size Distribution 

The methods in this paper require a parametric model for the super-population 
draws of the unit sizes. Although the overall mathematical framework is 
general, the case where the unit sizes are degrees is considered here. 

There is a substantial literature on models for the degree distributions 
of social networks (Handcock and Jones, 2004, 2006; Jones and Handcock, 
2003a,b). The naive models for these distributions include the Poisson and 
the Negative binomial (to allow Gamma over-dispersion relative to the 
Poisson). However, degree distribution tend to be long-tailed, suggest- 
ing alternatives such as the Yule and the Waring distributions (Jones and 
Handcock, 2003a), which allow power-law over-dispersion. Another in 
this class is the Poisson-log-normal, allowing log-normal over-dispersion 
(Perline, 2005). This allows longer-tails than the Negative Binomial but 
less than the power-law models. These models are unable to represent 
under-dispersion of the degree counts. The Conway-Maxwell-Poisson dis- 
tribution allows both under-dispersion and over dispersion with a single 
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additional parameter over a Poisson (Shmueli et al, 2005). It is possible 
to augment these distributions by directly parameterizing the lower-tail 
or upper-tail (e.g., f{l\rj), f(2\r)), J2T=2o f( k \v))- This can improve the fit. 
Each of these options was considered for this paper and are available in the 
R package degreenet on CRAN (Handcock, 2003; R Development Core 
Team, 2011). In many applications it is helpful to focus on two-parameter 
models which enable separate specification of the location and spread of 
the degree distribution (with the shape a feature of the model). 

S.3 Prior for the population size 

This issue is considered in Section 2.6 of the paper. In addition to the 
improper uniform prior, there are natural classes of parametric models 
for counts considered for the unit sizes above, (e.g., Negative Binomial, 
Poisson-log-normal, Conway-Maxwell-Poisson). 

Note that the data effectively truncates the prior below the sample size 

n. 

The most common parametric models for iV (e.g., Negative Binomial) 
typically have too thin tails for large N and require accurate specification 
of the prior mean and standard deviation. 

The class of priors proposed in Section 2.6 specifies prior knowledge 
about the sample proportion (i.e. n/N). This is based on the idea that the 
sample size may not be chosen separately from the population size but is 
often chosen in relation to it. So a simple prior is a uniform prior on the 
sampling proportion (i.e. the proportion of the population in the sample). 
This translates to a closed form for the prior on N which has infinite mean 
(and higher moments) and allows for very large population sizes. To allow 
a more flexible prior, we specify a Beta(o;, (3) distribution on the sample 
proportion. This is translated to a prior on the discrete support of n/N by 
assigning the probability mass to the closest discrete value. 

A uniform distribution on the sample proportion corresponds to a me- 
dian of twice the sample size. While the mapping from the mean to (5 does 
not have a closed from (for general (3) it can be easily computed numeri- 
cally. We have found the sub-class with a = 1 to be the most useful. For 
these the mode of the prior is at 0.5n(/3 + 1) and the median is given by 
n/(l - (1/2)^). 
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S.4 RDS is not amenable to the model 



The RDS design is not amenable to the modeling framework used in this 
paper, and here we explain why. In Section 2.2, modeling the sampling 
design along with the resulting data structure adds a great deal of com- 
plexity (see, e.g., (4)). It is natural to ask when we might consider the 
simpler face-value likelihood, 

n N 

L[r]\U obs = u obs ] oc 

i =1 u unobs £U(u obs ) j=n+l 

which ignores the sampling design. 

The sampling design is amenable, in the sense of Handcock and Gile 
(2010), if: 

P{G 9obs\U bs Uobs-i U uno l) S U unobs , Tj) P{G 9obs\U bs u obs)- 

If the sampling design is adaptive: 

L[v\U bs = Uobs, G = g obs ] oc P(G = g bs\U obs = u obs ) ■ L[r]\U obs = u obs ] 

Thus if the sampling design is adaptive then the sampling design is 
ignorable in the sense that the resulting likelihoods are proportional (Little 
and Rubin, 2002). When this condition is satisfied likelihood-based in- 
ference for i], as proposed here, is unaffected by the (possibly unknown) 
sampling design. 

However, based on equation (2), the sequential sampling design is 
clearly not adaptive. Likelihood inference should be based on the full like- 
lihood given by equation (1). 

S.5 Recruitment Graph of the Female Sex Work- 
ers in Sonsonate 

Figure S.l is a graph of the recruitment tree for RDS of female sex workers 
in the department of Sonsonate, El Salvador. The data is considered in 
depth in Section 4. 
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Figure S.l: Graphical representation of the recruitment tree for the sam- 
pling of female sex workers in Sonsonate, El Salvador in 2010. The nodes 
are the respondents and the wave number increases as you go down the 
page. The node gray scale is proportional to the network size reported by 
the worker, with white being degree one and black the maximum degree. 
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